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ABSTRACT 

Classroom tests developed by seventh- and 
eighth-grade science teachers (n=23) and mathematics teachers (n=18) 
were analyzed by panels of content and measurement experts. The 41 
participating teachers, each of whom contributed 2 tests, completed a 
questionnaire, an interview, and 2 measures of competence in testing. 
Teachers used all major item formats in their classroom tests. 
Science teachers favored multiple-choice items and mathematics 
teachers favored computation items. Faults were found in 35 percent 
of completion items and 20 percent of multiple-choice items on 
teachers tests. Average test quality on 6 dimensions was rated 5.0 
to 5.7 on 7-point semantic differential scales. Test quality was best 
predicted by scores on a multiple-choice measurement competency test. 
The sample of classroom tests is described, evaluated, and then 
related to teachers' training and experience, knowledge of testing 
and content of test use to learn more about this pervasive, crucial, 
and understudied type of testing. Three tables and one figure 
illustrate study findings. (SLD) 
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..t.n tw^*^J'? '^^'^^"^ P'^^^^^y "'^'^ teacher-made tests than he or she has 
hlZ hamburgers yet we may know even less about the tests than about the 

hamburgers. Developmem of such tests is hardly a franchise operation. Few teacher are 
pven directions or prescriptions; no organized quality control is practiced. Nevertheless the 
^ts remain the pnmary basis for a multitude of educational decisions, includ^ grading 
^.edTo heT nn'' ^^tics of actual classroom tests, and how good are thfse S 

judged to be? Do teache.j have sufficiem professional skill in test development to turn 
content knowledge imo more than hamburger? (Food for thought?) 

With the cooperation of a sample of science and mathematics teachers, we examined 
actual classroom tests developed by individual teachers. In addition, teachers compleLXo 
measures of competence in testing plus a questiomiaire and an interview. '^"'P'^'''^ 

Research questions 

What types of tests do teachers construct? (e.g., what types of items are used?) 
For what purposes do teachers test? 

JLtpm" usSS°" """""" °' 

Do many items contain violations of item-writing principles? 

What is the judged quality of these tests? 

Is test quality related to teacher characteristics? More specifically, does test quality relate 

.. teachers' measuremem competence, and their ability to detect faulted items^ 
expenence, number of measurement courses, measuremem knowledge, or adequacy 
of measurement training? ^ auc4uavy 
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METHOD 

Parts of this section also appear in Boothroyd, McMorris, and Pruzek (1992) and are 
reproduced here for the convenience of the reader. 

Sample 

Seventh- and eighth-grade science and mathematics teachers were selected for the 
study. Judgmg from prior research, classroom testing occurs with the greatest frequency for 
those grades and subjects, and such restrictions provided some degree of homogeneity. 

Strong efforts were undertaken to obtain a sample that met prespecified criteria (e g 
developed their own classroom tests) yet varied in terms of the independent variables of this 
smdy (e.g., coment area, experience, and type of school). Names of potemial participants 
were obtained from a variety of sources including graduate courses at local colleges and 
umversities, school districts, directors of teacher cemers, teachers, and friends Teachers 
were screened by telephone to ensure that they were either provisionally or permanently 
state-certified m either 7th- and 8th-grade science and/or mathematics, were teaching within 
their certification, had primary responsibility for constructing their own classroom tests and 
did not depend on an item manual accompanying the textbook. Only one teacher was 
excluded because oi not constructing his/her own classroom test items. 

The 41 participating teachers represented 25 pub.ic and private schools districts from 
many geographic regions in the state. No more than two teachers were selected from any 
one district with one exception in which four teachers were included. The districts were 
quite vaned and included public (88%) and private (12%) schools in urban, suburban, and 
rural settings. ^ 

Twenty-three teachers (56%) taught 7th- and 8th-grade science while 18 taught 
mathematics at this level (44%). Approximately two-thirds (68%) were permanemly state 
certified m their discipline while 13 (32%) had provisional certification. Female teachers 
outnumbered males by nearly a two-to-one margin (63% to 37%, respectively). The degree 
feaching expenence was somewhat evenly distributed, averaging 12 years but quite 
variable (SD = 7.2 years). 

Instrumentadon 

Each teacher supplied the researchers with two classroom tests which he/she had 
developed. For each test, three judges used a rating form in responding to questions of test 
charactenstics and quality. In addition, each teacher devoted approximately three-and-a-half 
hours to answering a muhiple-choice test of measuremem competence, identifying items 
comaimng rule violations, responding to a questionnaire, and imeracting in an 
interview. ® 

Test Rating Fo rm. The rating scale was designed to describe and assess classroom tests on 
six dimensions that many authors of measuremem textbooks suggest are importam to a test's 



ERIC 



4 



overall test quality (e.g., Hopkins & Antes, 1985; Nitko, 1983). A preliminary version of tlie 
rating form was pilot tested with seven participants in a doctoral-level measurement course 
who each rated two classroom tests. The resulting form, a semantic differential, contained 
39 adjective pairs. 

Given that quality ratings were desired on each of the six dimensions and that some 
of the adjective pairs were more descriptive in nature as compared to evaluative, seven 
judges were asked to classify each adjective pair as either evaluative (i.e., a characteristic 
clearly good or bad) or descriptive. Nine items were classified as descriptive and therefore 
analyzed separately. The six test dimensions and the number of evaluative items per 
dimension: presentation/appearance (6), directions (4), length (2), content sampling (7), 
item construction (6), and overall quality (5). 

The scale was used by two panels of three raters each, with one panel for science 
tests, the other for mathematics tests. Each panel consisted of a measurement specialist, 
a subject-matter specialist, and a person with both measurement and subject-matter 
expertise. 

Mean ratings over items and raters were computed for each dimension and each test. 
Internal consistency reliabilities ranged from .60 for length to .98 for overall quality. 

Measurement Competency Test (MCTl A 65-item, four-option, multiple-choice test was 
developed to assess teachers' knowledge of various measurement concepts specific to 
classroom testing. The test included items on test planning, types of items, item writing, 
reliability, and validit)\ 

For the 41 teachers' responses to the final 65-item test the item difficulties were 
somewhat evenly distributed. Twenty items (31%) were relatively easy (p > .7), 23 items 
(35%) had moderate difficulty (.4 to .7), and 22 items (34%) proved difficult (p < .4). All 
but two items had positive item discrimination values, with 51% (33 items) having 
discrimination indices above .33. A more complete description of the items and the 
development procedures may be found in Boothroyd et al. (1992). 

Item Judgment Task (IJT) . Teachers reviewed 32 multiple-choice and completion items 
related to junior high school science and mathematics, identifying items considered "good" 
items and items perceived as "poor" items. Violations of recommended item writing 
principles (flaws) were introduced into some of the items. The 32 items were equally 
divided between mathematics and science, and further faceted to include an equal number 
of multiple-choice and completion items. Within each of the four resulting cells, 3/4 of the 
items (12 of 16) contained a "flaw" in item coastruction. 

Six types of flaws were included, three in multiple-choice items and three others in 
completion items. Multiple-choice flaws included: (1) a cue repeated in both stem and 
answer, (2) the longest, most detailed option as the keyed response, and (3) options lacking 
homogeneity and plausibility. Flaws incorporated in completion items included: (1) blanks 
in either the beginning or middle of the statement, (2) nonspecific responses as possible 



correct answers, and (3) omission of a nonessential word, such as a verb. 

Analysis on teachers' responses to these items revealed that the greatest proportion 
of items (14 items/44%) were easy (p > .7), five items (16%) had m derate difficulty (.4 
to .7), and 41% (13 items) were difficult (p < .4). Two items had negative discrimination 
values, 12 items (38%) Iiad discrimination indices less than .1, and 12 items (38%) had 
discrimination levels greater than .33. A more extensive description of the UT items, 
including development procedures and illustrative items, is presented in Boothrovd et al' 
(1992). ^ 

interview PrQtoc(>l . A 36-question interview protocol was developed as a means for 
providing some structure to the interviews and thus helping to ensure that consistent data 
were acquired for each teacher. The questions were designed to collect information on five 
topics: (1) the teacher's classroom testing practices and test development procedures [11 
Items], (2) his/her measurement training [5 items], (3) school/district poUcies and/or 
regulations specific to tesfing [4 items], (4) criteria the teacher used when making judgments 
concermng good/bad item decisions [3 items], and (5) the classroom tests submitted for 
review [13 items]. Given that the study was exploratory in nature, some addifional quesfions 
were added for the purpose of exT^loring additional issues that arose during some of the 
imtial teacher interviews. 



RESULTS 

Results are reported according to research quesfions. 
Wxat types of tests do teachers construct? 

Information on teachers' tests was obtained by examining classroom tests they had 
developed. Of the 82 tests submitted for review (two tests per teacher), 64 (78%) were unit 
or chapter tests, 17 were midterm/final examinafions (21%), and one (1%) was a quiz. The 
number of days of content the tests were designed to cover ranged from two days to 200 
days. The average number of items on a unit test was 40 (SD = 32.6) while this figure was 
91 items for midterms and finals (SD = 45.5). The teachers indicated that the unit/chapter 
tests tend not to be cumulative (i.e., do not contain material from previous tests) while 
midterms and finals typically cover all previously presented material. Both unit/chapter and 
midtemi/finals are typically administered to muhiple classes as indicated by an average of 
67 students per unit or chapter test and 86 students per midterm or final. 

In Table 1 the tests are described by item type. According to both the teachers' 
self-report estimates and the second author's independent analysis of their tests, computafion 
items were the most popular for mathematics teachers and multiple-choice items for science 
teachers. Further, many formats were used by each set of teachers; indeed, with a more 
liberal defimtion of essays to include extended computational items, all these major types 
of Items were used by each set of teachers. 
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For what purposes do teachers test? 

r....i ^ ^^^^ ^^sponses to the interview question "Why do you tesf^" 

revealed fear pnmaiy reasons in addition to a number of secondary considerations Mo^t 
frequently cited by a majority of the teachers (69%) was the response: "to Lenudem? 
masteiy and understanding of the content taught in class." 

e.-^-T remaining three primary reasons were cited much less frequently, albeit with 
similar frequency to each other. Instructional reasons were cited by 33 pe;cent ofThe 
^h^/eported that students' performance on classroom tests proxdde them vS h an 

nl«.. ^'f'""^ T mentioned by 31% of the teachers. Many of these teachers did not 
Ser?nlld^h«Vt';'^^^ ^^"'^^"^ ^^^^"^^^ -^^^-ess but 

78fi^ Motivation was the fourth basic reason teachers offered for testing, and was cited bv 
28% of the teachers. These teachers believed that students would no do thfSed 
readings or senously study the course material if tests were not given. Many of Ae Sers 

t?ev dn / ■ ''^^ ^'^^ ^" -^"y ^^^P^^^ to other of aS^^^^^^ 

they do dunng class but are treated in a more formal mamier by both students and teachers 
As such, students perceive the tests as more important than other classroom acti4ies tSe 
them more senously. and prepare for them to a greater extern. activities, take 

ZToprnToLV:^^^^ ^""^ ""^ ^"'"^^^^ '--^ - 

Over half of the teachers (54%) indicated they generally develop some form of f^^.t 
plan pnor to constructing a test. Although these plail are t^ically not f^S r^^^^ 

t^^'.^"' ^^^^^^^^^^^^ ^^^^^^ of'r topLTb^ 

covered on the test. Most of the teachers indicated their planning process involves 

select i[^i;?fII;'n.'Xt ^'^'"5^''^' ^'''^'^^ ^« ^^^^^°P ^^^^^ items or 
select Items from other sources to assess each of the topics. Slightly over one-third of the 

l^achers (34%) indicated that they weight topics by var^ng the Lmbrof Ueml f r took 
The procedures teachers described for deciding on weights for topics involvS^ther taS 
mto accoum the amount of time that was spent in class on spedf c topTc or S eLi^^^^^^ 
importance of specific material. In either case, these teachers indkat^d thatThevTd 
more items on the test for topics which they deemed more impS or fo^^^^^^^^^ 
devoted a greater amount of class time. ^^^^ 

Many of the teachers reported during the interviews that they use differem item 



formats for different types of content. These teachers indicated that the item format thev 
used was most generally related to the cognitive level of the item. In deice forTxlnl^^ 
a number of teachers reported using alternate response (i.e. tme?Mst Z/nTt^ 
matchmg items for lower cognitive-level items, such as concept d^fi"^^ 
while essay items were used to assess higher cognitive levels such as syXsL sle o 
^achers also distinguished between item formats requiring recogniSon L mtch^^^^^ 

flT? '"T""'' "'"^'T'" ^"'^ '^^'^ item formats neU Sting^^eS? t f 

completion, shor answer). Few teachers, however, indicated how they "balance" hei'; 
classroom tests with respect to ..le issue of cognitive level. 

Do many items contain violations ofUem-writing principles? 

the tJh^T^t I'Z.^^""!' V'' r^^'P^r^°i^^ ^o'^Pl^^ion items submitted by 
rn,^]!- f u • With flaws detected in 35% of the cornpletio n and 20% of the 

multiple-choice items. Most frequently observed problems in the comp etion items wire 
blanks in the begimiing or middle of a statement (25% of all cornplS item t^n fh! 
request for a nonspecific response (14%). Nonhomogeneity of response o So^T^^^ 
lonaT ^T"' "^"Itiple-choice items reviewed, with the same perTema3S 
ongest option as the key. Cues were discovered in five percent of the items 0^2! 

« torl'^""^ " ^'^'^^ ''^^'^'^'^ - ^-'^-'^ presetrL^:t^^^^^^^^^ 
What was the judged quality of these tests? 

Panelisfs'Ssi.nerl'^hnJ"'"'^ ' three-judge panel using semantic-differential items. 
She (ml'^^^^^^^ di"^-"sions, judging appearance the 

9 r^lv^n ,•: PP'"^ '^^'^^ ^"'^ ^^^^ ^^"g^h the lowest (mea-a == 5 0) (see Table 

aLS^ teTJ ^^^^^^^^ greatelS^ 
llTLri^h^^!^ T ^PP^^ance and in adequacy of directions (SDs = 1.3), and the 
least vanabihty in item construction and adequacy of contem sampling (SDs = .8) 

Is test quality related to teacher characteristics? 

(MCT) and on the Item Judgment Task (UT). For the 21 teachers in th^ hTto k^J 
were in the top half on both predictors. For 20 teachers in the top half on rated telt 
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quality, 10 were in the top half on both predictors; 2 were in the bottom half on both. 
These relatioaships are detailed in Figure 1. 



These two extreme groups differ in teaching experience and measurement 
background, as noted in Table 3. The high group is the somewhat more experienced group. 
For each of the three measurement variables, the high-group mean exceeds the low-group 
mean by more than half a standard deviation. 



DISCUSSION 



Guttman (1970) expressed in a classic cartoon the imbalance of research emphasis 
on test design vs. test analysis. Similarly, study of classroom tests and their developers has 
lagged behind study of standardized, published measures. Classroom testing is the basis for 
such a variety of decisions involving instruction, grading, and other uses, yet as professionals 
we know little about the qualities and characteristics of such tests. We have done little to 
describe, let alone evaluate, these evaluative devices. 

Every day, the number of tests taken in schools, and the number and type of 
decisions based on information from those tests, could perhaps best be described graphically 
by an astronomy professor from Cornell. And if we include the other types of assessment 
information used by teachers and students (see, e.g., Airasian, 1991; Stiggins, Conklin, & 
Bridgeford, 1986), the amount of information, the number of decisions, and the impact of 
those decisions becomes virtually incomprehensible. Especially given that teachers' training 
m formal testing is so limited, and their training in informal assessment is even more 
limited, we are concerned about 1) the quality of the measures, 2) the ability of the teaching 
professionals to provide professional interpretations of information and appropriate 
decisions using that information, and 3) our own ability and resolve to formulate and 
respond to educationally important questions. 

Item types used by the science teachers in our study agree with item types found in 
jumor high science tests by Fleming and Chambers (1983, p.33). Rank-order of occurance 
IS the same across studies: multiple choice was most popular, followed by matching short 
answer/completion, true false, and essay. For teachers more generally, however Fleming 
and Chambers found the short answer/completion format most popular and matching a 
distant second. ^ 

For our sample, 20% of the multiple-choice items contained faults. Similarly in the 
Oescher and Kirby (1990) study, "Of the 18 tests containing multiple choice items, 17 were 
judged to have flaws in more than 20% of these items" (p. 13). Carter (1986) also found 
faults in teacher-made tests. Of the tests Carter reviewed, 78% strongly favored the key in 
£, 86% had at least one item with a longer correct answer, 47% contained at least one stem 
cue, and 58% contained at least one grammatical clue. 

But what are the impacts of item faults on teacher-made tests? Certainly items may 
be made easier by faults (Dunn & Goldstein, 1959; McMorris, Brown, Snyder, & Pruzeli 



1972; Haladyna & Do'Aning, 1989a; i989b). Tests containing item faults are inconsistent 
with Nitko's (1983) principle that "test items should elicit only the behaviors which the test 
developer desires to observe." (p. 141) We would expect faulted items to introduce 
extraneous variance; such variance would, in turn, reduce somewhat the validity of 
descriptions and decisions based on the test. 

Other, more subtle impacts are also possible. Students judge tests and their 
developers. Do you expect them to respect a bogus test or an incompetent test developer"? 
How many times did your attitude about a teacher or professor change as a result of taking 
your first test in a course? To illustrate, how do you felt about a author who make 
grammatical errors? And on how many other dimensions would you as a student have been 
able to describe and discuss a teacher's test? Would you have considered easiness, content 
balance, and understanding or application vs. pedestrian knowledge? The 'teacher 
communicates so much mth a test. Student attitude toward the course, the instructor and 
the subject must be affected by that test and its interpretation. 

Classroom evaluation affects student in many ways. For instance, it guides 
their judgment of what is important to learn, affects their motivation,and 
timing of personal study (e.g., spaced practice), consolidates learning, and 
affects the development of enduring learning strategies and skills. It appears 
to be one of the most potent forces influencing education. (Crooks, 1988) (p. 
467) 

Tlie impacts of a test's characteristics and quality, then, are not just in producing appropriate 
or extraneous variance on the measure itself. The impacts also include student attitudes and 
perceptions which affect what they bring to the next encounter of an evaluation kind. 

One disheartening, anecdotal index of teacher frustration and student achievement 
levels came from the teacher interviews in this study. Some teachers admitted they 
intentionally mcluded clues in items so some weaker students could answer some items 
correctly. Admittedly, if done with a sense of humor on an informal "test" that is essemially 
intended for review, there may easily be some positive benefit. If done when a less 
contaminated measure is desired, the extraneous variance may be expensive. At a 
mimmum, intentional use of clues can be investigated in further studies. 

Additional samples of teachers would provide appropriate replication We would 
recommend including outcome measures assessing characteristics/quality of teacher-made 
tests and mdependent measures for measurement competency, measuremem training 
experience, etc. Extensions to our instruments could better specify knowledge of teacher? 
ability and practice in grading, reporting/communicating, sizing up, instructional pacing, and 
performance testing. Understanding how item characteristics and score distributions should 
tollow from type of objective could also be tested (see Terwilliger, 1989). 

An outcome of our profession's lack of emphasis on classroom assessment may be 
to allow standardized testing to win by default. As noted by Stiggins et al (1986) 
laypersons and policymakers maintain that schooling outcomes are measured best and fairest 
by standardized paper and pencil tests, which severely restricts the variety of outcomes used 
for accountability. Similarly, research on teaching has also depended excessively on 
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standardized tests to represent school achievement. Such tests are not constructed to be 
rn^rna ly sensitive to instruction (Hanson, McMorris, & Bailey. 1986; Mehrens & Phillips 
1987). Issues concerning and techniques for assessing fit between test and airricalum are 
reviewed by Crocker, Miller, and Franks (1989). curncaium are 

Relationships of published achievement tests with instruction are being examinied in 
more sophisticated ways, and additional research is needed. Such investigations ^1^ 
have applicability to local districts and enhance the assessment, of studfnt achi^vemem 
Jed bL\rrbreV'^^ "^"f"^ ""V^' -^'^h instructs 
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